Search CORE

5 research outputs found

A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching

Author: Gonen Alon
Ioffe Sergey
Konda Pradap
Mozafari Barzan
Seung H.
Tan
Tong Simon
Wang Jiannan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/03/2020
Field of study

Entity Matching (EM) is a core data cleaning task, aiming to identify different mentions of the same real-world entity. Active learning is one way to address the challenge of scarce labeled data in practice, by dynamically collecting the necessary examples to be labeled by an Oracle and refining the learned model (classifier) upon them. In this paper, we build a unified active learning benchmark framework for EM that allows users to easily combine different learning algorithms with applicable example selection algorithms. The goal of the framework is to enable concrete guidelines for practitioners as to what active learning combinations will work well for EM. Towards this, we perform comprehensive experiments on publicly available EM datasets from product and publication domains to evaluate active learning methods, using a variety of metrics including EM quality, #labels and example selection latencies. Our most surprising result finds that active learning with fewer labels can learn a classifier of comparable quality as supervised learning. In fact, for several of the datasets, we show that there is an active learning combination that beats the state-of-the-art supervised learning result. Our framework also includes novel optimizations that improve the quality of the learned model by roughly 9% in terms of F1-score and reduce example selection latencies by up to 10x without affecting the quality of the model.Comment: accepted for publication in ACM-SIGMOD 2020, 15 page

arXiv.org e-Print Archive

Crossref

Complaint-driven Training Data Debugging for Query 2.0

Author: Abuzaid Firas
Agarwal Alekh
Boehm Matthias
Chapman Adriane
Gilpin Leilani H.
Giordano Ryan
Green Todd J.
Kang Daniel
Kantchelian Alex
Khanna Rajiv
Koh Pang Wei
Konda Pradap
Krishnan Sanjay
Li Yuliang
Matthew
Metsis Vangelis
Rahm Erhard
Ribeiro Marco Túlio
Ré Christopher
Shrikumar Avanti
Sundararajan Mukund
Tanaka Daiki
Xu Jingyi
Zhang Xuezhou
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 12/04/2020
Field of study

As the need for machine learning (ML) increases rapidly across all industry sectors, there is a significant interest among commercial database providers to support "Query 2.0", which integrates model inference into SQL queries. Debugging Query 2.0 is very challenging since an unexpected query result may be caused by the bugs in training data (e.g., wrong labels, corrupted features). In response, we propose Rain, a complaint-driven training data debugging system. Rain allows users to specify complaints over the query's intermediate or final output, and aims to return a minimum set of training examples so that if they were removed, the complaints would be resolved. To the best of our knowledge, we are the first to study this problem. A naive solution requires retraining an exponential number of ML models. We propose two novel heuristic approaches based on influence functions which both require linear retraining steps. We provide an in-depth analytical and empirical analysis of the two approaches and conduct extensive experiments to evaluate their effectiveness using four real-world datasets. Results show that Rain achieves the highest recall@k among all the baselines while still returns results interactively.Comment: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Dat

arXiv.org e-Print Archive

Crossref

Magellan

Author: AnHai Doan
Derek Paulsen
Doan A.
Doan A.
Govind Y.
Govind Y.
Kaushik Chandrasekhar
Konda P.
Konda Y.
Matthew Christie
Mudgal S.
Papadakis G.
Papadakis G.
Paul Suganthan G. C.
Philip Martinkus
Pradap Konda
Yash Govind
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref